-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize the TableVectorizer
column-wise
#592
Conversation
It works on the latest sklearn version (1.2.2), but not on the older ones (I think it would be quite painful to make it work). Do we still want to support only recent sklearn versions? @GaelVaroquaux |
+1 for supporting only sklearn 1.2.2 and later (can we try 1.2.1 ?). Can someone update our requirements and CI build matrix? @LilianBoulard ? |
Cool ! It also works for 1.2.1 |
Something I don't really like is that the
but I'm not sure we can avoid it with this solution. |
@glemaitre do you have thoughts on this? Do you think it could be worth implementing the parallelisation directly in sklearn? |
I am not convinced that the strategy is beneficial for all types of transformers. To give a concrete example, I think it would be faster to run a So the proposed strategy would make sense if internally to the transformer, there is already an explicit I might be missing some details to grasp the reason to dispatch a column per transformer here. |
Thanks for your answer @glemaitre !
Agree! The idea would be to have a list of transformer classes for which it is useful.
But right now the transformers used inside the *Now that I think of it, it is indeed suprising to set the |
Yep, I see the issue regarding nested parallelism and oversubscription. But I don't know if we should special case or instead better handle the nested parallelism in Having the nested parallelism should be somehow manageable but require the user to set properly the I just thought now that UX-wise, having multiple instances of the same transformer for each column will not be user-friendly when it comes to inspection. We could always rebuild a single instance but it smells hackish from far away (and then there is the same issue regarding the parallelism at |
Yes I agree
If the user provides a |
It is what I meant :). |
TableVectorizer
column-wise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @LeoGrin, here is a couple of additional remarks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @LeoGrin, here is a couple of additional remarks :)
Co-authored-by: Vincent M <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're almost done, here are some final comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @LeoGrin, that's a great feature! Let's add some missing docstring, then LGTM :)
You have conflicts with the main branch |
Thanks for all the comments @Vincent-Maladiere ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some last docstring comments
Co-authored-by: Vincent M <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM!
Fix #586